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Abstract 



|- — ■ We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node 

in an interlinked environment. We investigate the optimal attack pattern for a group of nodes 
(attackers) attempting to improve the ranking of a specific node (the victim). We consider 
attacks where the attackers can only manipulate their own outgoing links. We show that the 
optimal attacks in this scenario are uncoordinated, i.e. the attackers link directly to the victim 
and no one else, nodes do not link to each other. We also discuss optimal attack patterns 
■ for a group that wants to hide itself by not pointing directly to the victim. In these disguised 

attacks, the attackers link to nodes / hops away from the victim. We show that an optimal 
disguised attack exists and how it can be computed. The optimal disguised attack also allows 
us to find optimal link farm configurations. A link farm can be considered a special case of our 
Q\ ■ approach: the target page of the link farm is the victim and the other nodes in the link farm are 

I/"") ' the attackers for the purpose of improving the rank of the victim. The target page can however 

. control its own outgoing links for the purpose of improving its own rank, which can be modeled 

as an optimal disguised attack of 1-hop on itself. Our results are unique in the literature as 
we show optimality not only in the pagerank score, but also in the rank based on the pagerank 
score. We further validate our results with experiments on a variety of random graph models. 
Keywords: link analysis, pagerank, link spam, spam farms 

^ ■ 1 Introduction 
H 

Generally, a search for a particular topic on a particular search engine (such as Google) will output a 
ranked list of relevent web pages. The prominence of a page in this listing is an important indicator 
of how many people will visit the page. For a commercial web site, its prominence with respect to 
product searches has important financial consequences, as does the prominence of a competitor's 
website with respect to slander about products. Prominence in rankings is prestigious, can add 
credibility to a site or a concept and can be used to make political statements [16]. For example, a 
series of attempts, called Google bombs, to improve the ranking of certain sites for a specific keyword 
were used to give weight to a specific political point of view, e.g., making the web-biography of 
the U.S. President the top hit for the term "miserable failure'!!]. As a result of the importance 



*A preliminary workshop version of this paper was presented at the First International Workshop on Adversar- 
ial Information Retrieval on the Web (AIRWeb 05) in conjunction with the 14th International World Wide Web 
Conference (WWW2005), Chiba, Japan, 10-14 May, 2005. 

x The first Google bomb was with respect to the text "talentless hack". Since then several other attacks also 
succeeded in raising the ranks of web pages with respect to specific keyword(s), in some cases using as few as 25 links. 
It has been argued that several factors contributed to the success of these attacks: the number and prominence of 
the attacking pages; the (un)popularity of the keyword, the use of the same keyword in all links, the higher rankings 
of Blogs due the frequency of their updates, etc. Some of the keywords chosen in these attacks were very rare on 
the Web at the time of the attack: "French Military Victories" . However, even attacks using keywords as popular as 
"Weapons of Mass Destruction" have been successful (BBC News, Sunday, 7 December, 2003). 
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attached to one's pagerank, especially one's Google pagerank, artificial methods for boosting one's 
pagerank are an active area for discussion. Pagerank is one of the many factors that is used in 
Google's ranking algorithm [T8] and a significantly high pagerank can boost the prominence of a 
page considerably. 

In addition to Google bombs that were oriented towards an external site, a web-retailer could 
also make use of link manipulation to improve the prominence of its own web-site with respect to 
a particular topic(s). Link farms are a common method for boosting pageranks |10] where a set of 
dummy pages are purposefully created to improve the pagerank of a specific page. However, in a 
link farm, the targeted page is controlled by the link farm as well. The link bombers or spammers 
are usually some (coordinated) set of web pages which add outgoing links to their web page. Some 
of these links will point to the attacked page, and contain the text they (the bombers) are trying 
to associate with the attacked page. The issue we address is how these bombers should organize 
their outgoing links in order to maximize the success of their link bomb in terms of pagerank score 
and rank. 

There has been discussion on whether a link bomb can be considered an "undesirable" attack |20] 
that exploits a weakness in the pagerank-style algorithms [121 [T5] . The pagerank algorithm assigns 
you a pagerank by considering the number and importance (according to PageRank) of web pages 
that point to you. Given that a search engine like Google currently ranks over 10 billion pages, 
one would expect that a very small number of web pages should not be able to change the ranking 
of a page dramatically, contrary to what has been observed. Thus, one motivation for studying 
the optimal attack is to determine specific abnormal but effective attack patterns that could be 
identified as artificial link bombs. 

We present results on the optimal link bomb. Specifically, the attackers are a set of web pages 
whose outgoing links can be manipulated, and the victim is the target web page to be bombed. 
THe victim's outgoing links cannot be manipulated. Our main result is to establish the following 
theorems as a starting point for a discussion of accountability on linked structures such as the 
WWW. 

Theorem 1. The attack which maximizes the pagerank score of the victim is the direct individual 
attack. 

Theorem 2. The attack which maximizes the rank of the victim is the direct individual attack. 

Rank is the order statistic defined by the pagerank, and the direct individual attack is the attack 
in which every attacker points only to the victim and to no other page. In particular, in the optimal 
attack, none of the attackers point to each other. Thus, the optimal attack masquerades as a set of 
uncoordinated "random" nodes, all pointing to the same page. Note that both the stated theorems 
are non-trivial. An attack that maximizes pagerank score of the victim is not necessarily one that 
maximizes the pagerank rank, if the attack also raises some other node's pagerank score above the 
victim's score. To our knowledge, our result is the first result in terms of rank. 

We also discuss optimal "disguised" attack patterns, in which none of the attackers wish to 
directly point to the victim - all paths from the attackers to the victim must be of at least some 
minimum length I from the victim. In this case the optimal attack is still a direct individual attack, 
however now the attackers point to some other intermediate node (not the victim). 

Theorem 3. There is an optimal disguised attack in which every attacker's only link is to the same 
node which is distance t — 1 from the victim. 

In our work, we assume that the attackers can only control their outgoing links. They may 
not control the outgoing links of any other nodes (so the target page is outside the attacking set). 
Here, we give the extensions and complete proofs of optimality of the original results on optimal 
direct and disguised attacks, which were initially discussed in preliminary form in pQ. In disguised 
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attacks, no attacker can point to the attacked node. For malicious link bombs, it is reasonable to 
assume that the target is outside the attacker set. When the attackers are trying to boost one of 
their own sites, however, the attackers can control the outgoing links of the target page. This is 
the case in link farms. The optimal configuration in a link farm therefore follows directly from our 
results: in the link farm all the attackers except the target use the direct attack to improve the 
target's rank. The target (who is now also an attacker) uses the optimal disguised attack of length 
£ = 1, after the other attackers have made their direct attacks to boost its own rank. 

While the optimal attack is always the direct individual attack, the amount by which the direct 
individual attack surpasses other (more coordinated) attack patterns may depend on the nature of 
the graph. We give experimental results that quantify this phenomenon for a variety of different 
attack patterns. On certain random graph models of the Web, some coordinated attack patterns 
are almost as good as the direct individual attack, and can hence be used in place of the direct 
individual attack as a means of disguising the attack. While the effect of graph structure on the 
pagerank has been investigated in the literature |171I12|. to our knowledge, these are the first results 
regarding the effect of the graph structure on the effectiveness of link bombs. 

Our results raise interesting questions such as how to detect and respond to link bomb attacks 
(in general this problem is NP-hard, see for example [22]). Since the attackers will have no visible 
associations amongst themselves, it is hard to detect and prove that they are participating in an 
attack. If the optimal attack were a tree structure, there would be a small set of nodes with high 
prominence that one might argue are "responsible" for the attack. The other nodes pointing to 
these nodes could also be held accountable aiding and abbeting the actions of the responsible nodes. 
Such accountability is not possible in an individual attack. 

We proceed by first discussing the related work and giving some preliminary definitions, followed 
by a preview of our result for an isolated graph, in which the only nodes are the attackers and the 
victim. We then discuss general graphs, followed by some experimental results on a variety of 
random graph models. We conclude with a discussion of the implications of our results. (We defer 
some technical proofs to an appendix) 

1.1 Related Work 

Link spam has received significant attention recently, and most of the work goes along the lines 
of quantifying the impact of different collusion strategies on pagerank [HI EJ [7J [10] . Bianchini [4] 
analyzes the impact of different community structures in the optimal energy, i.e. total pagerank 
value for a set of pages. Another line of research concentrates on the problem of modifying pagerank 
to make it resistant to such collusion strategies (14 4 123 1 [6], In particular, |14] concentrates on using a 
set of handpicked trusted sites to bias the pagerank computation and develops methods for selecting 
seeds to be evaluated in this algorithm. Similarly, Zhang et. al. [23] develop a method for stalling 
the random jump probabilities to reduce the impact of colluding web pages. Caverlee et. al. [6] 
introduce the notion of domain or host level influence throttling to combat link spam. Drost and 
Scheffer [9] introduce machine learning algorithms to recognize spam pages, including those with 
link spam; their work considers both number of incoming and outgoing links as well as features 
related to the content. 

We highlight two of the works which are the most closely related to ours. The work by Gyongyi 
and Garcia-Molina [13] was developed independently of ours and has a similar flavor. In particular, 
they consider the case of optimal link spam structure under the assumption of constant leakage, 
which is a significant limitation. Additionally, they compute the magnitude of the attacks for 
various attack patterns. The limitation of the constant leakage was addressed by Du, Shi and Zhao 
in [10]. In particular, they consider the possibility that the attackers can have control of other 
pages in addition the link spam farm. Du, Shi and Zhao also consider disguised attacks, when 
the attacking nodes must point to non-target nodes (in addition to the possibility of pointing to 
target nodes). One difference between this existing work and ours, is that it is focussed on the 
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pagerank bombing. We do provide results for pagerank bombing, but our main result is to show 
that the same optimality of the direct attack holds for the rank, which is more difficult to analyze. 
However, we do not quantify algebraically the improvement in pagerank scores for different link 
farm configurations, which is shown in Du, Shi and Zhao and Gyongyi and Garcia-Molina. 

Our results for optimal disguised attacks can be converted to algorithms, however these algo- 
rithms require global knowledge of the graph to implement, which may be non-realistic for bounded 
complexity attackers. Du, Shi and Zhao [10] also consider disguised attacks, but allow the attackers 
to point to the target but in a non-obvious way; an optimal strategy chooses nodes to point to so 
as to minimize the leakage in pagerank forced by the disguise. In general, computing any optimal 
disguised attack should involve the knowledge of the entire graph. An interesting open question 
is whether there are near-optimal disguised attacks which can be locally computed, only knowing 
some bounded in and out-neighborhood of the target and similarly some bounded in-neighborhood 
of the attackers. 



2 Preliminaries 

A search query on a set of keywords results in an ordered list of web pages W = {00%}- Each web 
page uj € W contains some or all of the keywords either in its text or in the text of a link that 
points from some other web page to w. A scoring function is used to order the pages in W. The 
most prominent page (page with the highest score) is given rank 1, etc. 

Google [5] considers many factors in its scoring function, including: keyword frequency; relative 
locations of the keywords; the position and style of the keywords. An important factor in the scoring 
function is the pagerank which depends on how the web page is embedded in the entire graph of 
web pages. An early paper on the Google system [5] suggests that no one factor dominates the 
scoring function, however, the pagerank plays an important role. In this paper, we will concentrate 
only on the pagerank factor and discuss how it can be manipulated. 

The web graph is a directed graph G = (V, E) that models the World Wide Web. The vertex set 
V represents the pages and documents, and the edge set E represents the links between the pages 
and documental. The edges are directed: if (v\,V2) £ E, then v\ contains a link to V2- In a web 
graph, the in-degree indeg(v) of page v is the number of links that point to v and the out-degree 
outdeg{v) is the number of links originating from v that point to other pages. A (directed) path of 
length i is a sequence of vertices vq, v\, . . . , V£ with Vi) £ E for i = 1, . . . , I. vi'vs, the terminal 

node in the path, and v±,... are intermediate nodes. We allow parallel edges between two 

vertices, but no self-loops. 

The pagerank pi models the probability that node i will be visited either by randomly navigating 
down links in the web graph or by randomly jumping to page i. Let a be the probability to navigate, 
and 1 — a the probability to jump. Then the pager anks {pj} of the nodes in a graph simultaneously 
satisfy the set of linear equation^! 

EPj 1 — a . 

(0 < a < 1 and N = \V\.) The first term represents the probability to reach i by random navigation. 
An edge may appear multiple times if there are parallel links. The second term represents the 



2 Note that the definition of an edge is traditionally given by hyperlinks in a web page. However, it is also possible 
to count URLs in the body of a web page as links. The definition of what constitutes a link is usually application 
dependent. 

3 An alternative and common formulation of the pageranks in the literature is as the stationary distribution of a 
suitably defined finite irreducible Markov chain with transition matrix P = (1 — a)M + all, where U is a matrix of 
l's. Many of our results could be obtained by analyzing how the stationary distribution changes under perturbations 
of P. Our approach is more graph theoretic, treating the problem as a flow. 



4 



probability to reach i by randomly jumping. Typically, a € [0.85, 0.95]. The pagerank pi is larger if 
Vi has a large in-degree, and its incoming links are from high pagerank nodes with small out-degree. 
The PageRank algorithm |18j is an iterative approach to solving these equations. The pager anks 
are all initialized to = i. The PageRank iteration is given by 



p\ converges to the (unique) solution of ([I]) . We assume that every page can manipulate its outgoing 
links, but it cannot change its incoming links. 

A link bomb, or attack occurs when a group of attackers A = \v\, . . . ,vk} alter their outgoing 
links so as to boost the pagerank of a victim vq A. Before the attack, if the edge set is E, then 
after the attack the edge set will be E where the only edges added or removed from E are of the 
form (vi,u) where 1 < i < K and u € V, i.e., the attackers may remove and/or add outgoing 
links only. After the attack, the new web graph is G = (V,E). Let pi denote the pageranks in the 
original graph G (before the attack) , and pi the pageranks in G (after the attack) . The magnitude 
of the attack Apo = po — Po is the amount by which the pagerank of the victim increased, and is a 
measure of the success of the attack. In our analysis, we only consider the magnitude of the attack, 
and assume that all other factors entering into the scoring function are unchanged. 



In this section, we investigate how to maximize the magnitude of the attack. In particular, we show 
that the effectiveness of the attack does not increase if the attackers try to coordinate the attack 
in some way, by introducing links among themselves in order to increase their ranks. (Recall that, 
incoming links from higher ranked pages are more beneficial to your rank.) First, we consider a 
simplified case, in which the attackers and the victim are isolated from the rest of the graph. We 
then consider the general case. 

3.1 Isolated Graphs 

We first restrict our attention to a graph whose vertex set is composed only of the attackers and 
the victim, V = A U v o (i.e., N = \V\ = K + 1). Assume (for simplicity) that vq does not point to 
any member of A. We first consider some examples of attacks, before giving the general result. In 
all cases, all the attackers in A point to the victim vo, and what differentiates the attacks is how 
the attackers are themselves organized. 

Direct Individual: The only links are to vq. 

Tree: The attackers form a tree. For any graph with a topological order, one can compute the 
pageranks efficiently (in linear time). We will specialize to a star attack in which V2- ■ ■ ■ , vk 
point to v\ and all attackers point to vq. 

Cycle: The attackers form a cycle. 

Complete: The attackers a complete graph. 





3 The Optimal Link Bomb 
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Direct Individual Tree Cycle Complete 

By solving the linear system ([1]) for the graph resulting from each of these attacks, we obtain 

Lemma 4. For the isolated graph, 

po(individual) = Po(l + otK), 

po(star) = po(l + f(#(l + a) + l-a)) , 
Po(cycle) = Po (l + f^), 

po(complete) = p (l + , 

where po = (1 — a)/(K + 1) is i/ie initial pagerank of vq. 

Since < a < 1, after some algebra, we obtain 

Theorem 5. For the isolated graph, 

po(individual) > po(star) > po(cycle) > po(complete) . 

We will show that the direct individual attack is optimal for the isolated graph. Note that 
when a node has zero outdegree, it "stunts" the flow of pagerank. This means that the sum of the 
pageranks need not be 1, i.e. {pi} need not be a probability distribution. Summing ([T]) over i, we 
get 

If outdeg(vi) > 0, Vi contributes ^jdeglv ) exa °tly outdeg{vi) times to the summation for a total 
contribution of pi. If outdeg{vi) = 0, then Vi does not contribute to the summation, so we obtain 

^2 Pi = a ^2 Pi + t-u, 

i outdeg(vi)>0 

= a^pi + l - a - a p,,. 

i outdeg(vi)=0 

After rearranging terms and solving for YliPi' we obtain the following useful lemma. 
Lemma 6. ^ pi = 1 — — ^ pj. 

i outdeg(vj)=0 

This lemma is useful for proving the next theorem; though it is a special case of the general 
result in the next section, it is illustrative and the proof gives an intuition for the general case. 

Theorem 7. For an isolated graph, the individual attack uniquely maximizes pq. 
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Proof. Since ^2 i=1 Pi = Y^i=oPi ~Po, Lemma H gives 

K 



EPi = 1-7-^— y\ Pi-po, 

i=l outdeg(vi)=0 

1 — a 

with equality i/f u is the only vertex with degree 0. For an arbitrary attack, 

EPi 1 — a 

outdegivi) + F + T' 

(vi,vo)eE 
( a ) 1 - a 



Solving for po> we obtain that 



1 - a J K + 1 



This bound is attained by the individual attack. Uniqueness follows because equality in (a) occurs 
iff outdeg{vi) = 1 whenever (vi,Vo) G E, and equality in (b) occurs iff every edge (vi,vo) is in E. 



3.2 Arbitrary Graphs 

When vq, . . . ,vk are embedded in a larger graph G, the direct individual attack is still optimal. 
Intuitively, one can view the PageRank iteration (J2j) as sending a flow of pagerank down the directed 
edges. The maximum flow from vi to occurs when Vi points directly to vq, and to no other node 
- any other links divert the flow and leads to a lower magnitude attack. The following results will 
make this intuition more formal. We will generally refer to nodes which are neither the attackers 
nor the victim by Wj, and Uj will be used to refer to any node. The 1-neighborhood N\(v) of a 
node v is the set of nodes to which v points. Nk(v) (k > 1) is the set of ^-neighborhood nodes: 
u € Nk(v) iff for some w € A r / C _i(f), (w,u) € E. Note that v could be in its own ^-neighborhood 
for k > 1, and Nq(v) = {v}. In this section, many of the proofs are involved, and so we will sketch 
the intuition and defer the technical proofs to the appendix. 

Consider attacker Vi, and, without loss of generality, assume it initially has no outgoing links. 
Suppose now that it adds 5 outgoing edges. This results in j of its rank "flowing" along each of its 
edges to its neighbors (note there may be parallel links). Thus, the rank increase for a 1-neighbor 
Uj is given by 

A 1 = a — 

where the superscript 1 indicates that Uj is a 1-neighbor, and j is an index that enumerates the 
1-neighbors. The sum is over all parallel edges that Vi may have to Uj. This increase in rank in turn 
propagates to 2-neighbors, resulting in an increase in the rank of a 2-neighbor u& by an amount 

a! = « E A] 



outdeqiui 
s.t Uj£Ni(vi) 



7 



The sum is over all 1-neighbors pointing to u k (including parallel edges). If the newly added edges 



to be the change in the page rank of v,j from flow down all paths of length I from i>j to Uj , 



create a path from Vi to vq, then some amount of v^s pager ank will propagate to vq. We define 
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outdeg(u k ) 

s.t u k €Ni^i(vi) 

Let 5(1) be the total increase in page rank through paths of length I, 5(1) = ^jA 1 -. Since the 
pagerank increase attenuates by a factor a with each edge, we have the following lemma. 



Lemma 8. For I > 1, 



5(1) < a l Pi , 



with equality iff 5(1 — 1) = a l ~ 1 pi and for every u k £ Ni_\(vi), outdeg(uk) > 0. 

Proof. See Section 14.1.11 ■ 

Let S be a set of nodes. A path q passes through S if some node of S is an intermediate node 
of q. A set of paths P pass through S if every path in P passes through S. Let P% be a collection 
of paths that passes through S, with every path in Pt having the same terminal node t ^ Vi (t is 
not an intermediate node of any path in Pt). We call t a progeny of S with respect to the paths 
Pt- Since every path passes through S, some prefix of every path in Pt has a terminal node in S. 
For each path q € Pt, let qs be a (any) prefix with terminal node in S, and let Pt(S) denote the 
collection of such distinct prefixes {qs}- 

The influence I(S\Pt(S)) of i>j on S is the total flow of pagerank (summed over all nodes in S) 
from Vi to S along the paths in Pt(S) (which are (distinct) prefixes in Pt). The influence I(t\Pf) 
of Vi on t is the total flow of pagerank that flows to t along the paths in Pt (which pass through 
S). Every path in Pt has at least one additional edge compared with its corresponding prefix that 
terminates in S, so the influence that propagates to t along Pt can be at most the influence that 
propagates to S along the paths in Pt(S), attenuated by a factor a. We have the following lemma. 

Lemma 9. Let Pt be a collection of paths from V{ to t which passes through a set of nodes S 
(t appears only as a terminal node in P t ). Let Pt(S) be a (any) collection of distinct prefixes 
terminating in S. Then 

I(t\P t ) < aI(S\P t (S)), 
independent of which prefixes are used in the construction of Pt(S). 

Proof. See Section HX2l ■ 

We now consider v^s attack on vq. Let P denote the collection of all (distinct) paths from to 
vq in which vq appears only as the terminal node, i.e., vq is not an intermediate node of any path 
in P. Note that if there are cycles in the graph, then P may contain an infinite number of paths. 
Let the flow of pagerank from Vi to vq down the paths in P be denoted A. There may be cycles 
containing vq, in which case, the pagerank increase A will continue to flow around these cycles, 
back to t>o increasing the pagerank further, i.e., A will be amplified by the cycles. Let Ap' Q be v^s 
contribution to the magnitude of the attack, 

Ap (A) = A + amp(A), 

where amp(A) is the amplification due to the cycles that contain vq. The larger A, the larger will 
be the amplification of A, 

Lemma 10. Ap (A) is a monotonically increasing function of A. 



S 



Proof. See Section 14.1.31 



Lemmas 181 |9"1 and I1UI are the main tools we will need to prove our main result, namely that the 
individual attack is optimal with respect to pagerank. By Lemma [10] since A^q is monotonically 
increasing in A, A^q will be maximized when A is maximized. A is given by the sum of the flows 
of pagerank from V{ to vq along the paths in P, therefore we only need to consider this flow. 

Let £ be the length of the shortest path in P (there may be many such shortest paths). Consider 
the set L of all distinct paths of length £ originating at Vi. Some of these paths have terminal node 
fo- We now restrict our attention to the set L' containing those paths in L which do not have 
terminal node vq. Note that none of the paths in V can have vq as an intermediate node since the 
shortest path from to vq has length £. Let S denote the set of terminal nodes in L' . Partition 
P into two disjoint sets, Pi and P>e, where Pg contains the paths in P with length I and P > i 
the paths with length > £. Every path in P > £ must pass through at least one of the nodes in 5, 
therefore P > i passes through S. Every path in P >£ has terminal node vo, and vq does not appear 
as an intermediate node in any of these paths. Thus, vq is a progeny of 5 with respect to -P>^. 
Every path in P > £ has a prefix of length £ with terminal node in S. Collect these distinct prefixes 
into the set Py^S). 

Let Ai be the contribution to A due to flow along the paths in Pi, and A>^ the contribution 
due to flow along the paths in P > £. Then, 

A = A £ + A>^, 

( = } Ai a + I(v \P >e ), 
(&) 

< Ai + aI(S\P >£ (S)), 

< Ai o + I (S\P >e (S)), 

(d) . 

< K, a + E A i 

sg<S 

(e) 

< m 

< av% 

(a) follows from the definitions of A^ and influence; (b) follows from Lemma [9] and (c) because 
a < 1. (d) follows because the paths in P > i(S) are all of length £, so P > e(S) is a subset of all the 
paths of length £ that terminate in S; (e) follows from the definition of S(£), since S U vq C Ni(vi); 
finally, (f ) is an application of Lemma [8j Equality occurs iff S is empty, and all paths from v j 
are of length £, ending at ^o- Certainly, the optimal value of £ is 1, and so we have the following 
theorem@. 

Theorem 11. Ap l Q is maximized if and only if the only edge from Vi is to Vo. This is independent 
of all the other edges in the graph, in particular independent of the edges from the other Vj. 

Theorem 1111 directly implies the following result, 

Corollary 12. The direct individual attack is optimal for maximizing the pagerank pq. 

Though the direct individual attack maximizes the pagerank of vq, it is not obvious that this 
also maximizes the rank of vq, which depends on the relative pageranks. Is it possible that some 
other attack, though it will increase pq less, might increase it more relative to some other node 



4 An alternative proof of this theorem using the Markov chain approach can be given using a generalization of the 
result in [8], where it is shown that adding the edge can only increase the pagerank of j. 



9 



and hence improve i>o's rank more? The answer is no, i.e. the direct direct individual attack also 
maximizes the rank (as opposed to the pagerank) of the victim. 

Suppose that some other attack X maximizes the rank of vq. This means that for some node 
u, pl < Pu and p^ Q > p* (I denotes the direct individual attack). We show that such a situation 
can never occur, leading to the following result. 

Theorem 13. The direct individual attack maximizes the rank of vq. 

Proof. See Section [4.1.41 ■ 



3.3 The Optimal Disguised Attack 

We now consider the situation in which the attackers wish to maximize the magnitude of their 
attack on vq, but they wish to disguise the attack by not pointing directly to the victim. In such 
an attack, the anchor text will not be associated to the victim, hence we assume that the victim 
already has a high prominence with respect to the anchor text. The specific disguise constraint 
we consider is that for every attacker, the shortest path to the victim should have length at least 
£>l. 

Consider attacker Uj. In any attack, some amount of pagerank flows from Vi to vq. In any 
directed graph, we define f(u;v), the forward value of vertex it with respect to vertex v, to be the 
fraction of it's pagerank that flows to v along paths with v as terminal node but not as intermediate 
node. Thus, for example, f(v;v) = 1. Since the fraction of it's rank that makes it to v can be 
obtained by multiplying the fraction flowing to each neighbor with the fraction flowing from that 
neighbor to v, we obtain the forward equation for the forward values /(it; v): 

f(v;v) = 1, 

f{u;v) = £ f{w;v) - (3) 

The forward equation ((3]) is similar to the pagerank equation (TTJ and can be solved by a similar 
iterative algorithm as in ([2]). 

For every vertex it (not an attacker), we consider the edge set E u = E U (i>j,it), which defines 
a new directed graph in which the edge set is augmented by a single link from the attacker to it. 
For this graph, we can compute the forward value f u (w; Vq) of any vertex w with respect to v<j. We 
define the value Vi (it) of vertex u to attacker Vi by 

Vi(u) = fu(vi;v ). 

By Lemma [TOl the optimal attack is the one that maximizes the flow of pagerank to vq, which means 
that Vi should point to the node it satisfying the "disguise constraints" that maximizes Vi (it). There 
may be many optimal attacks, but we will now show that there exists an optimal attack for Vi which 
consists of adding a single link to the vertex it that maximizes Vi(u), which is at distance I — 1 
from vq. Let d(u,v) be the length of the shortest path from u to v; if no path exists from it to v, 
set d(u,v) = oo. Let Ui(vq) be the collection of nodes which have a path of length I to vq and no 
shorter path to vq. Thus, 

Ui(v ) = {u : d(u,v ) = I}. 

Suppose that the disguise constraint (which we apply to all the attackers) is that the shortest path 
from an attacker to vo must have length at least £. Let Ug-i = Ue~i(vo) be the nodes with a path 
of length £ — 1 to vq. First we show that the maximum value of V^(it) is attained for some node in 

Lemma 14. max V^(it) = max V^(it). 

u:d(u,vo)>l— 1 uGUe-i 
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Proof. See Section 14.2. li 



Lemma Q3] implies that we only need consider nodes that are distance i — 1 to v o in determining 
which intermediate node to attack. Note that for each u € Ui-i, in order to compute Vi(u), we need 
to compute f u (vi,vo), which may require the computation of / u (v,vq) for all v € V. By following 
arguments similar to those that led to Theorem \TT\ we find that the optimal attack for Vi is to 
point only to the vertex u that maximizes Vi(u). 

Theorem 15. The optimal disguised attack for a single attacker Vi is a single link to the vertex u, 
at distance I — 1 from vq, which maximizes Vi(u). 

Proof. See Section KT2[ ■ 

Note that the vertex that maximizes Vi(u) may not be unique, however by Lemma [TU we know 
that at least one such vertex exists in Ui—\. 

Unfortunately, the maximizing node V%{u) need not be the same for different attackers - the 
disguise constraint introduces dependencies between attackers, i.e., the optimal attack for a par- 
ticular attacker may depend on what the other attackers do. In particular, it is no longer the case 
that each attacker using its optimal disguised individual attack will maximize the magnitude of the 
disguised attack if the group of attackers act jointly. The following example with two attackers and 
1 = 1 illustrates the issue. 




o o 

(a) Optimal individual attacks 



© 

(b) Optimal joint attack. 



The optimal attack for v\ is to point to u, and for V2 it is to point to w (red dotted arrows in 
(a)). However, if both attackers attack, then they should both point to u. Theorem [15] applies to 
attacker Vi, independently of what the other attackers do. In particular, we conclude that in the 
optimal joint attack, every attacker has a single link to a node in U^\. In fact, there is an optimal 
attack in which every attacker links to the same node in 

Theorem 16. There is an optimal joint attack in which every attacker points to the same node in 
U t -i. 

Proof. See Section KT3[ ■ 

Theorem [16] ensures that an efficient algorithm to compute an optimal joint attack is to select 
the best attack among all the attacks in which the attackers all link to a single node in Ui-\ (there 
are at most OflVj) such attacks). 



4 Proofs 

For our proofs, we will need some standardized notation for discussing sets of paths, and flow of 
pagerank along these paths. A collection of paths P{w\W2\ x\X2 ■ ■ ■ Xk) contains all paths from w\ to 
W2 which do not contain the nodes x\, . . . ,Xk as intermediate nodes. The fraction of w\s pagerank 
that flows to W2 along the paths in P(wiW2; x\X2 ■ ■ ■ Xk) will be denoted p{w\W2] x\X2 ■ ■ ■ Xk). Since 
only positive flow flows along paths, we have the following useful lemma, 

Lemma 17. If S\ C S2 are two sets of nodes, then p(w\W2; S\) > p{w\W2; S2). 
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Consider cycles originating at a node w and not containing w as an intermediate node. Suppose 
that a fraction 7 of ii?'s page rank flows along these cycles back to w. Since this fraction can 
also flow back to w along the same cycles (attenuated by an additional 7 factor), by summing the 
resulting geometric series, we obtain the following useful lemma, 

Lemma 18. Consider a node w and a set of nodes S with w G S. Let 7 = p(ww;S). Then, 

the fraction ofw's pagerank that flows back of w through repeated use of the cycles in P(ww;S) is 
1 

1^7 ' 

4.1 Arbitrary Graphs 
4.1.1 Proof of Lemma [8] 

We prove the lemma by induction on /. When I = 1, if outdeg{vi) = then 5(1) = < api. If 
outdeg(vi) > 0, then 

5(1) = y a y p\ 

w ^ ^ outdeg(vi) 



api 



£ £ 1. 



(a) 



outdeq(vi , 
api. 



(a) follows because ^2 U . Ylivt u)eE ^ = outdeg(vi). Thus, 5(1) < api. Suppose that 5(L) < a pi 
and consider I = L + 1. 



5(L + 1) J2 A 



L+l 

i 

UjeN L+1 (vi) 

' outdeq(uu) 1 

Uj (u k , Uj )GE yy K ' 

s.t. u k &N L (vi) 

A L 

a V =* V V 1 

✓ outdeqiui ) ^— ' ^— ' 

u k eN L {vi) yy KJ uj (u k , Uj )£E 

s.t. outdeg(u k )>0 



( a ) „ A L 
s.t. outdeg(u k )>0 



a y: a 

a6(L), 



L 

k ! 



(6) 
< 

(c) 

W L+l 

(a) follows because X}(u fc « )eE ^ = °utdeg(uk)- Equality in (b) occurs only if all nodes Uk £ 
Nl(v{) have outdeg(uk) > 0. (c) follows from the definition of 5(L), and (d) from the induction 
hypothesis. Equality in (d) occurs only if 5(L) = a L pi. Thus the claim holds for all I > 0, which 
together with the conditions for equality concludes the proof of the theorem. ■ 
Since every link in a path attenuates the pagerank flow by at least a, we have the following 
lemma, which will be useful in the proof of Lemma [H 
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Lemma 19. For any two nodes u and v, not necessarily distinct, and any set of nodes S containing 
v, p(uv; S) < or, where £ is the length of the shortest path in P(uv; S). (Note, if u = v, then £ > 2, 
otherwise £ > 1.) 

Proof. We prove the lemma by double induction on £ and L, the length of the longest path in 
P(uv; S). If L = £, then p(uv; S) < 5(£)/p u , and by Lemma[8l we have p(uv; S) < or. 
Assume the claim true whenever I < k and L < K and consider £ < k + 1, L < K + 1. 

p{uv;S) = " V p(wv;S), 

outdeq(u) ' 

V ; {u,w)£E 



(a) 
< 



k 

a ■ a 



outdeq(u) 

V ; {u,w)£E 



< a k+l . 

(a) follows from the induction hypothesis because the shortest path length in P(wv; S) is at most 
k and the longest path length is at most L. Therefore, the claim holds for all £ > 1 and all L > £. 



4.1.2 Proof of Lemma [9] 

Consider a collection of paths Pi from Vi to t where t is the terminal node for all the paths, and does 
not appear as an intermediate node in any path. Let Pt(S) be the collection of distinct prefixes. 
For every path q 6 Pt, let s(q) denote the terminal node of its corresponding prefix in Pt(S). Let 
S = {s\, . . . , 5^}. We can partition the paths in P t into k disjoint sets P/, . . . , Pf according to the 
terminal nodes of the prefixes, i.e., for every path q E Pf, s(q) = Sj. Let A s . be the total (summed) 
flow of pager ank to Sj along the paths in P t J . 

I(S\P t (S)) = £ A Si 

Each path in P£ contains a suffix path from Si to t in which t does not appear as an intermediate 
node. Consider the fraction p of Sj's pagerank that flows along the distinct such suffixes to t. Since 
these suffixes are a subset of the paths in P(sit;t), we have that p < p(sit,t). I{t\Pl) can now be 
bounded as follows, 

I{t\Pl) = P A St , 

< p(sit;t)A Si . 

I(t\Pt) is the sum of the I(t\Pf)'s, so we obtain 

I- 

i(t\p t ) = E J (*i p < 



i=i 

k 

< ^2p(sit;t)A Si , 



(a) 



i=l 

k 



aI(S\P t (S)), 



where (a) follows from Lemma [T9l 
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4.1.3 Proof of Lemma [101 



Partition the set of cycles containing vq as initial and terminal node, but not as intermediate node, 
into two disjoint sets C\ and Ci. C\ contains the cycles which do not contain Uj and C2 contains 
all the cycles which also contain Uj. (Note C\ = P(voVo; voVi).) Let P VQ)Vi = P{vQVi]VQVi) and 
Pvovi = p(voVi; VoVi). (Note that p VQVi < a.) Every path in C2 is composed of a path in P VOtVi 
together with a path from V{ to v$ in which vq appears only as a terminal node (i.e. a path in 
P = P{viVq]Vq)). The fraction of Uj's pagerank that flows to vq along paths in P is by definition 
A/pi. Thus, the fraction of vo's pagerank that flows along cycles in C2 back to vq is p VQVi A/pi. Let 
7^o = P( v o v o'i v o v i) be the fraction of vq pagerank that flows along cycles in C\ back to Vq. Therefore 
the total fraction of vqs page rank that flows back to Vq along paths in C\ U C2 is 7„ + p VQVi A/pi. 
This fraction will be amplified again by the cycles in C\ and Ci- Thus, 

Apl = A + amp(A), 

where amp (a:) satisfies 

amp(x) = <j)x + amp(^), 

and 4> = 0(A) = Tu + Pv v t A/pi < 1. The unique solution to this equation (which can be obtained 
by expanding amp (0a:) repeatedly to obtain a geometric series) is 

amp(a;) = -. 

1 - (f> 

Substituting into the expression for Ap l Q , we obtain 

A^o(A) = — ^ . 

To conclude, note that the right hand side is monotonically increasing in A. ■ 



4.1.4 Proof of Theorem [131 

Consider the attack by a single attacker Vi. We will show that the direct attack is best for Vi 
independent of the rest of the graph, in particular what the other attackers do, from which the 
theorem will follow. For the direct individual attack not to maximize the rank (and some other 
attack X to maximize it), there must be some vertex u for which pl Q < p^ and p^ > p* ■ 

First consider the case when there are no paths from vq to u. Then, p VQ + Ap* < p u + Ap 1 ^ and 
p VQ + Ap* > p u + Apu- Since Ap u = (no paths from vq to u), 

Pu ~ Pv > Ap{, , p u - p V0 < Ap* Q - Ap%, 

which is a contradiction because Ap^ Q < Ap vo (Theorem [TH , and Ap^ > 0. 

Now consider the case when there are paths from vq to u. We introduce some definitions that 
will simplify the notation: 

Pv u = p(vou;viu), 
Pv vi = p(voVi;v Vi), 
Puv, = p(u,Vi;viu), 
lv = p(v v ;v Vi), 
lu = p(u,u;viu). 

(Tuoi Pv Vi are defined as in the proof of Lemma [TOl) j Vq and 7 n are fractions that flow along cycles. 
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Let A„ and A u be the pagerank flow from Vi to vq and u along the respective paths P(vi,vo; vo) 
and P(vi, u; u). Then, 



L = Pv + ApL ( A' ) , vl=Pu + Ap' u (Ai) 



Pvo fvq i "/'vo v"i>o 

pf = Pvo + Ap* (A* ) , pX= Pu + Ap x (A* ) . 

(/ denotes the direct individual attack and X the other attack.) In the direct attack I, the only 
paths from v% to it are through vq. In the attack X, there may be paths from Vi to it that do not 
pass through vq. Therefore, we have 

^■u = PvquA Vo , A u > p VoU A VQ . 

As in the proof of Lemma [TUl let 

nr. 

G(x;j,p) 



Then, 



&p$o(Alo) = G ( A i„;7»o./'wi). 

Aj^(Aj) = G(Af ; 7vo ,^ OUi ), 

and, 

A^(A^) = G(Al; lu ,p UVi ), 

= G(p voU A VQ ; 7 n , p UVi ), 

= PvquG(A Vq ; 7 n , p VO u Puvi )> 

Ap*(A*) = G(A*;7„,p UUi ), 

{a) x 

^ G[PvquA Vo ; 7«? Pnui J ? 

= PvouG(A Vq ; "f u , p VQU p UVi ) . 

(a) follows because G is monotonic in x, and we have used the identity G(Xx;j,p) = AG(x;7,Ap). 
it is such that p^ — pl > and p x — p x Q < 0. Thus, 

Pu ~ Pvq > G(A VQ ;-f V0 ,p V0Vi ) - p V0U G(A VQ ;-f u ,p V0U p UVi ), 

Pu ~ Pvo 

Combining these two equations, we find that 

-^(Au , A VQ ; 7t, , PvQVi ) > PvquF{A Vo , A^ ; j u , p VQU Puvi ) > 

where 

F(xi,x 2 ;7,p) = G(xi;7,p) - G(x 2 ;-y,p), 

(xi - x 2 )(l - 7) 



Since A^" < A^ (Theorem [TT]) . we obtain 



PvouF(A^ , A^ q ; 7 U , p VQ uPuvi) F{^v Q i ^v 'i Ivqi Pvqvh 



A x - A 1 A x - A 1 

Let poo = p(v$vq; voViu), and let p uu = p(uu; voViu). Let Q = (1 — poo)(l — p«u)- We will need the 
following lemmas to complete the proof. We will prove the lemmas after the proof of the theorem. 
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Lemma 20. 1 - 7„ = Q ■ (1 - -y u ). 
Lemma 21. p VQVi > Qp VoU p UVi . 

By Lemma [21] and the monotonicity of F with respect to p, we have 

' «o ' 7uo ' Q PvouPuVi , 



A x - A 1 A x - A J 

"mo MQ fo «0 



or that, 



PvouO- ~ 7m) 



(1 — 7« _ Pv uPuVi _ 7lt _ PvouPuVi^^-) 



> (!-7«o) 



(1 — 7uo ~~ QPv uPuvi^^-){^ — 7«o ~~ QPvQuPuVi^^-) 



W Q(l - 7«) 



2 A "0 ' 

Q (1 — 7m — PvouPuvi 17~")(1 — 7m — PvouPuVi ) 



where (a) follows using Lemma [20j After some algebraic manipulations, we obtain 



_ 1 - poo 

t — Pu 



In any attack, ^o's pagerank flows to u with attenuation p(vqu;vqu) amplified by 1/(1 — p(uu; i>o))- 
Since p n 's pagerank cannot be smaller that what flows from vq, we have 

p{v u; v u) 
1 - p{uu;v ) 
W p(vou;v Viu) 

1 — p(mt; VQViU) 
(6) (1 - p^p^p; voViu))p(v u; Vju) 
1 — p(iiit; VQViU) 

(_£_) (1 - Pw)Pv u 

— 1 _ Pu ) 

J- Pmm 
> Pmo- 

(a) follows from Lemma [TTJ (b) follows because using Lemma [181 

p(v tt; iwu) 



1 - p(v «o; uoUi«) 



and, (c) follows from the definitions of Poo, Pum Pv u- Thus, p u > p 1>0 for any attack, in particular, 
for the attack X, which contradicts the fact that p x > p x . This contradiction implies that no such 
vertex u can exist, which concludes the proof of the theorem. ■ 



4.1.5 Proof of Lemma l20l 

We use the same notation as in the proof of Theorem [T3l Let 5 = {vo,Vi, u}. 70 = p(vqVo; DpUj) is 
the fraction of up's rank flow back to vq along paths in P{vQVQ\VQVi). The paths in P(vQVo;vQVi) 
can be partitioned into paths that contain u and paths that do not. The paths that contain u 
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are paths in P(vqu; S) concatenated with paths in P(uu; S) concatenated with paths in P(uvo; S). 
Therefore, using Lemma [THl 

70 = p[y v ; S) + — — — . 

1 — p(uu; Sj 

Applying similar reasoning to ~f u , and using the definitions for poo,p uu , we obtain 

p(v u;S)p{uv ;S) 



7o = Poo + 

Tit — Puu ~i~ 



1 ) 

Puu 

p(v u;S)p(uv ;S) 



1 - Poo 

Let A = p(vou; S)p(uvo; S). We find that 

(1 - poo)(l - Puu) ~ A 



l-7o 
1 ~7u 



1 ' 

^ Puu 
(1 - poo)(l - Puu) - A 



1 - Poo 

It now follows that (1 — 70) = Q(l — j u )- 
4.1.6 Proof of Lemma [2T1 

We use the same notation as in the proof of Theorem 1 131 Let S = {vq, Vi, u}. Then, 

p(v u;S)p(uvi;S) 



Pv v, = p(voVi;S) + 

PvqU 



1 - Pu 

p(v u; S) 



p UVi = p(uvi, S) + 



1 - Poo 

p(uv ;S)p(v Vi;S) 



1 - Poo 



Therefore, we find that 



p(vou;S)p(uVi;S) p(vqu; S)p(uv ; S)p(v Vi; S) 

HPv uPuVi — -. ' tA \n \ 

1- Puu (1 - Puu) (1 - POOj 

/ c n , p{vou;S)p(uv ;S)p(v Vi;S) 

= Pv Vi -p{V Vi;S) + r- r . 

(1 - P««)(l - Poo) 

After rearranging terms, we obtain 

p(v u; S) p(uv ; S) 



Pv Vi ~ QPv uPuvi = p(voVf, S) ■ 1 

\ 1 - Puu 1 — POO 

= p(voVi] S) ■ (I- p{v u; Viu)p(uv ;v Vi)) . 

(a) follows from Lemma [T8l To conclude, note that p(vou; Viii) p(uvo; voVi) < a 2 (Lemma I19p . and 
so the right hand side is > 0. ■ 

4.2 The Optimal Disguised Attack 

For the optimal disguised attack, every path from an attacker to the victim must have length > i. 
We only consider the case that such attacks are possible, in particular, L^_i is not empty. 
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Consider the graph with the edge set E u = EU(vi,u). Let P u (vw; x±, . . . , Xk) and p u (vw; x\, . . . , Xk) 
be defined with respect to the edge set E u in exactly the same way that P(vw; x\, . . . , x^) and 
p(vw; x\,. . . , Xk) were defined in the previous section. Note that 



From the forward equation, by replacing the summand by the largest term, we have, 

Lemma 22. p u {vvq\v q ) < ap max (v), with equality iff p u (wiv ; v Q ) = p u (w 2 v ; v ) for all w 1 ,w 2 
such that (u, w±), (u, w 2 ) G E u . 

Lemma 23. There is at least one vertex w* S L^_i with p u (uvo;vo) < p u {w*vq;vq) 

Proof. Consider p u (uvq\Vq). We can assume that p u (uvq\vq) > (i.e., d(u,Vo) < oo), and that 
d(u,vo) > £ — 1, as otherwise there is nothing to prove. Choose w\ in the 1-neighborhood of u 
such that p u (wvq;vq) = p m ax(u). If there is more than one possibile choice for wi, select the 
choice for which d(wi,vo) is minimized, breaking any further ties arbitrarily. If d(wi,Vo) = i — 1 
then we stop, otherwise we define w 2 in a similar way to w\\ w 2 is a vertex in Ni(wx) such that 
Pu(w 2 v ;vo) = Pmaxiuii)- In general, w i+1 = argmax p u (w i+1 v ; v ), breaking ties according to 

{Wi,w i+1 )£E U 

distance. By Lemma [22j since a < 1, for the sequence u,wi,w 2 , . . . , 



Further, if p u (wiVo',vo) = p u {wi + \Vo; vq) (which can only happen if a = 1 and all neighbors have 
the same p), then, since the ties were broken by distance, d(uii,vo) > d(wi+i,vo). Thus, there are 
no repetitions in the sequence u, wi,w 2 , .... Since there is a path from u to vq and d(u, Vq) > t — 1, 
by the pigeon hole principle, we conclude that at least one vertex w* in this sequence is distance 



Note that equality in the Lemma can only occur if a = 1, thus for a < 1, it is strictly better to 
be in Ut-\ than not. To prove Lemma HU we will show that Vi(w*) > Vi(u). 

4.2.1 Proof of Lemma 1141 

Suppose that the maximum is attained for a vertex u with d(u, vo) > £ — 1. Let w* € Ug-i be such 
that p u (w* vo; vo) > p u (uvo;vo) (Lemma [23 guarantees the existence of such a vertex). We show 
that Vi(u) < Vi(w*). From the definitions of V{ (u) and Vi(w*), we have that 



Vi(u) = f u (vi; v ) = p u {viv ; v ) 




Pmax (^) 



max p u (wv ;vo) 
(v,w)eE n 



p u (uv ;vo) < puiw^vo) < p u {w 2 vo]Vo) <■■■ ■ 



£ — 1 from 



Vi(u) 



= Pu(viV ;vo), 

= ap u (uv ;vo), 

(a) 

< ap u (w v ;v y, 

= Pw*(ViV ), 

= ap w *(w*v ;vo). 



Vi(w*) 
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(a) follows from the definition of w*. By considering the paths which reuse Vi and those which do 
not, we have that 

p u (w*v ;v ) = p u (w*v ;viV ) + p u (w*Vi;ViVo)p u (viV ;vo), 
= p u (w*v ; ViV ) + ap u (w*Vi;viVo)p u (uvo; v ), 

(b) 

< Pu[w v ;ViVo) + ap u (w*Vi;ViV )p u (w*v ;vo), 

( c ) 

= p w *(w*v ;viv ) + ap w *(w*Vi;viV )p u (w*v ;v ); 
p w *(w*v ;v ) = p w *(w*v ;viv ) + p w *(w*v i ;v i vo)p w *(v i vo;vo), 

= p w *{w*v ;viv ) + ap w *(w*Vi;viV )p w *(w*v ;vo). 

(a) follows because the only edge from V{ in E u is (v{ , u) , and similarily for (d) . (b) follows from 
the definition of w*. (c) follows because only difference between E u and E w * is that the edge 
(vi,u) in E u is replaced by the edge (vi,w*) in E w *. Therefore all paths that do not include Vi as 
an intermediate node are identical in E u and E w , and so the corresponding p's are equal. Since 
p w *(w*v ;viv ) > 0, p w *(w*Vi;viV ) < 1, solving for p w * (w*v ; v ) and p u (w*v ; v ), we get 



p w *(w*v ; v ) = 
p u (w*v ;v ) < 



p w *(w*v ;viVo) 
1 - ap w *(w*Vi;viV ) ' 

p w *{w*vo;viV ) 
1 - ap w *{w*Vi]ViVoy 
p w *(w*v ;v ). 



Thus, Vi(u) < Vi(w*). 



4.2.2 Proof of Theorem [TBI 



Lets consider an arbitrary attack X in which vi has links to w\, W2, ■ ■ ■ , w m , where d(wj,vo) > £ — 1 
for j G Suppose that Uj has fcj links to io^. Let Ex = E U {(i>i, -}^=i De t ne augmented 

edge set, where (vi,Wj)k represents kj copies of (v{,Wj). Let px(viVo;vo) be the fraction of Uj's 
rank that flows to uq along paths in Px (viVo; vq) , 



&v = Px(viV ;v )pv,. 

For the single attack I, v% has only one link to w* £ C/^-i, where is such that Vi(w*) > Vi(u) 
for all n such that d(u, vq) > £ — 1. 

A l = Pw*(ViV ]V )p Vi . 
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By Lemma [TUl it suffices to show that A^ > A*, i.e. that p w * (viVo; vq) > px(viVo',vo)- Let 
K = ET=i k r Using©, 

m 

Px{viV ;v ) = jr^kjpxiwjVQ-^vo), 
i=i 

m 

= j^^2 k j [px(wjV ;viVo) + px{wjVi;viVo)px(viV ;v )} , 
i=i 

m 

= -^^Z k i [Pwj (u!jv ; ViV ) + p w . (wjVi;viVo)px(viV ; v )] , 

3=1 

(c) # EJLi fejPwj (ifj«o; Viv ) 



< 



1 - ap^wv^ViVvy 

id) , v 

where iD = argmax^. (wjVo; viVq). (a) follows by partitioning the paths from iOj to i>o hito those 
that use Vi and those that do not; (b) follows because the paths that do not use Vi are identical 
in both Ex and E w ' (c) follows after solving for px(viVo] vo)\ and (d) follows after solving for 
Pwj{viV ;v ) in 

p Wj (viV ;v ) = a [p Wj (wjv ; ViV ) + p Wj («»; ViV )p Wj (viV ; v )] . 
To conclude, note that by the definition of w*, p w *{viVQ; vq) > pm{viVQ\ Vq). ■ 

4.2.3 Proof of Theorem US 

Let X be an optimal attack in which each attacker's only link is to a node in Ue~i (not necessarily 
the same node for each attacker). By Theorem 1151 such an optimal attack exists. Suppose that 
attacker v\ points to node Wi G Ue-i. Then, 

p(viv ;v ) = ap(wiV ;v ). 

Let A = {vo,vi, . . . ,vk} denote the set containing the attackers and the victim. We use the 
notation p uv = p(uv; A) to be the fraction of rank flowing from u to v along paths that do not 
contain a node of A as an intermediate node. Let v* be the attacker satisfying 

p(v*v ;v ) > p(viV ;v ), 

where v* and V{ are attackers, and denote the node that v* points to as w*. Then, p(w*vo; vq) > 
p(wiVo;vo) for all i G 

We now consider the attack X in which every attacker only points to w*. We will show that for 
every node v, p(vvo;vo) > p(vvo;vo), where p (resp. p) is the fraction of u's rank that propagates 
to vq under attack X (resp. X). First consider w*. We have 

K 

p(w*v ;v ) = p w *y + a} j Pw*ViP(wiVo; vp), 

i=l 

< Pw*v + a 2_^ Pw*v i p(w*vo;v ), 

i=l 

® Pw*vq 



< 



i-«Ef=iP 



K 

" l U)*Vi 
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(a) follows because p(wiVo;vo) < p(w*vo;vq), and (b) after solving for p(w*Vq;vq). Paths that do 
not pass through an attacker are identical in the attack X and X. Thus, p uv = p uv and so 

K 

p(w*v ;v ) = p w * V0 + a ^ p w *v i p{w*VQ\ v ), 

i=l 

i b ) Pw*vo 



1 _ rv ST K 

1 " 2-(i=l Pw*Vi 



(a) follows because every attacker in X points to w* and (b) after solving for p(w* vq;vq). Thus we 
conclude that p(w*vq;vq) < p(w* vq; vq) . Now consider an arbitrary node v. 

K 

p(vv ;v ) = Pwq + ^ PvvipjviVo; vp), 

i=i 

K 

= Pwq + a} j pvv i p (wjVQ ;vq), 

8=1 

5 u ( * \ 

< Pvvo + a 2_^ P m i P( W V 0' V 0)i 

8=1 

(b) * 

< Piw + a Pw,p(w*v ; Vq), 

8=1 

= p(w ;v ). 

(a) follows by definition of w*, and (b) because p(w*vq;vq) < p(w*vo; vq), thus the magnitude of X 
is at least as large as the magnitude of X. ■ 



5 Experimental results 

In this section, we give some preliminary experimental results that quantify the effectiveness of link 
bombs in various environments. There are four main degrees of freedom we explore: the nature of 
the graph, including its connectivity or edge density; the prominence (pagerank) of the attackers; 
the prominence of the victim; and, the value of a. 

We ran our experiments on three types of graphs: Random is an Erdos-Reyni type (G(n,p)) 
random graph with edge probability p; BA (Barabasi- Albert) is a preferential-attachment random 
graph with 5 outgoing edges per vertex [3]; (Such graphs are known to have power-law in-degree 
distributions, and since we add the vertices sequentially, there are no cycles.) MWDTA is a 
modified "Winner's don't take all" random graph in which every node has at least one out-going 
edge p2]. (Such graphs are known to model certain characteristics of the world wide web graph 
such as power-law in and out-degree distributions.). The main difference between MWDTA and 
BA random graphs is that in MWDTA, a larger number of nodes will have significant in-degree, 
whereas in BA a few nodes have very large in-degrees. In order to make fair comparisons, we 
normalize graphs from different random graph models (Random, BA or MWDTA) to have the 
same expected number of edges. 

First, we generate a random graph with 1,000 nodes, and randomly select 10 attackers and a 
victim. We then remove outgoing edges from the attackers and perform a pagerank computation, 
obtaining: 

Pq: the page rank of the victim; 
PA- the average pagerank of the attackers; 
f p (p)- the pagerank distribution in the graph; 
a p : the std. dev. of the pagerank distribution. 



21 



We only show results for two of the attacks described in Section I3.lt the optimal direct individual 
attack /, and the cycle attack C (the results for other suboptimal attacks are similar). Each attack 
is repeated a number of times on randomly generated graphs to increase the statistical significance 
of the results. We use the following measures of success for attack X, 

G(X) = Gain - Ap °" 



Po 

G(X) = Normalized Gain 
D(X) = Discrepancy Factor 



x 



^Po 
G(I) 



G{xy 

D(X) = Normalized Discrepancy = G{I) - G(X). 

The pagerank distribution f p (p) generally affects the effectiveness of an attack. Figure Q] shows 
pagerank distributions for the various random graphs. As can be seen, Random has a (near) 
Normal distribution, compared with BA and MWDTA which have power-law type distributions 
in which MWDTA appears to have a slightly fatter tail than BA. 

Some detailed results on the effectiveness of the attacks are shown in Figure [2) (a) shows how 
connectivity (number of edges) in Random graphs with different p affects the attack; (b) shows 
different graph types; (a) and (b) show the dependence on the prominence of the attackers, and (c) 
shows the dependence on the prominence of the victim; (f) shows the dependence on a. Figure [3] 
shows some results for the rank (as opposed to the pagerank). We give a summary of the results 
below. 



Higher Density: All attacks decrease in magnitude (new edges have little additional effect when 
the graph is already dense). 

Graph type: Prominence of attackers has (by far) the largest impact in Random graphs, as com- 
pared to BA and MWDTA. (Pageranks in Random graphs are "concentrated" around the 
mean, so any bias in the victim's pagerank results in it becoming extreme. This is less so for 
BA and even less so for MWDTA.). 

Higher Prominence of Attackers: Stronger attack. 

Higher Prominence of Victim: Attacks become less effective and D{C) decreases (diminishing 
returns). 

Lower a: D(C) increases (it is more costly to divert from the individual attack). 

Rank: For random graphs, an attack usually results in a top ranking for the victim, which is not 
usually the case for BA and MWDTA graphs. 



6 Discussion 

We have shown that the best attack is the direct individual attack, in particular: any organized 
structure among the attackers reduces the impact of the attack; links that cycle back to attackers 
in an attempt to boost their pageranks are detrimental. The discrepancy between the optimal 
individual attack and suboptimal attacks can strongly depend on the graph type through the initial 
pagerank distribution. Our results indicate conditions that offer resistance to rank manipulation: 
dense, power-low type graphs in which victims already have high rank, attackers have low rank 
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Figure 1: Pagerank distributions of different graphs 
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(a) Graphs with different edge densities. 
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(b) Attacks in different graph types 
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(c) Dependence on victim's pagerank 
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Figure 2: Experimental results on pagerank 



24 




h 
3c-., 



%\ i' ¥ * / 



■ I ■ * 



\* * 1 



Individual, Random(5/(N-1)) — h- 

Cycle, BA — x- 

Individual, BA * 

Cycle, MWDTA b 

Individual, MWDTA — ■-• 
i i i 



100 200 300 400 500 600 

Average rank range of the attacking group 



Figure 3: Experimental results on rank 
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and a is small. Our analysis has been focused on increasing a page's rank (pagerank manipulation) 
in the entire graph, i.e., the victims rank is increased for every query. The underlying model is 
that the query identifies a set of nodes (based on text and anchor text), which defines an induced 
subgraph of the original graph. However, the nodes are ranked according to pagerank in the original 
graph. This model has the feature that pageranks do not need to be recomputed for the specific 
query. An alternative approach is to order the nodes with respect to the pageranks in the induced 
subgraph (hence these pageranks would need to be recomputed for every query). Such a model 
would mean that one attempts to boost the pagerank with respect to a specific query and not 
others. Our analysis does not apply to this model, and it is no longer true that the optimal attack 
is the direct individual attack. The following example (with a single attacker) illustrates: 





(a) Original graph. 




(b) Direct attack. 




(c) Indirect attack. 



In (a) we show the original graph, where X will be the query text and the attacker wants to boost 
the rank of vo with respect to X. In (b) we show the subgraph induced by the direct attack, where 
the attacker places X in its page as well as in the anchor text of the link. In the resulting induced 
subgraph, the rank of v$ is not the highest. The benefit of the non-direct attack in (c) is that other 
nodes that point to v$ get included into the induced subgraph. Thus while the flow of rank from 
v\ to vq is decreased, this is more than compensated for by the additional rank contribution from 
the newly included nodes. A better attack would arise if v\ added another link to Vq. In fact for 
any attack in which v\ has k links to Vq, a strictly better attack with k + 1 links is possible. In this 
example, there is no optimal attack. In general, we can formulate this notion by saying that the 
attacker should add the minimum number of links to all nodes with paths to the victim which do 
not contain the query text, and hence would not be included in the subgraph. The attackers should 
then place as many parallel direct links to vq as is feasible. The end effect is to include all nodes 
with paths to the victim with a minimum diversion of page rank. Of course, such a huge attack is 
not very practical, and an interesting question is to consider the optimal attack under this model 
when each attacker has a fixed budget of links. 

The PageRank algorithm favors attacks from groups that are not well connected, which makes it 
harder to detect the attack, and accountability in such an attack formation becomes an issue: who 
is responsible for the attack? Different variations of the PageRank algorithm may suffer a similar 
fate if they propagate the pagerank in a similar way (for example Topic-Sensitive PageRank |15j . 
provided that the attacking group is considered relevent to the query). In order to avoid such a 
fate (a dilemma faced by any ranking method open to manipulation by small groups), either one 
must change the ranking function or somehow exclude the attacking group from the search engine's 
database. While such an approach is a reasonable way to deal with private companies attempting 
to manipulate rankings based on their own views, it is not very democracy-friendly to arbitrarily 
remove certain pages from a search engine. 

As discussed in [12] . the PageRank algorithm makes certain assumptions about the user navi- 
gation patterns and the web structure that may not apply to the Web anymore. [12] considers the 
effect of dangling nodes in the pagerank computation and provides methods to adjust for them. 
They also point out that users will rarely (if ever) navigate to one of several billion pages uniformly 
- they may not even know that these pages exist. In fact, users generally start from known sites 
and navigate from there. Hence, random navigation is more likely to bring them to one of these 
"anchor" sites. The HostRank algorithm [12] uses this assumption to choose a set of anchor sites, 
and they show that such an approach is more resistant to attacks. Trustrank algorithm 



uses a 
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set of trusted pages to bias the random jump probability. An interesting problem would be to check 
whether the selection algorithm for trusted pages can be manipulated (if it is not fully manual). 
For example, pages can exhibit trustworthy behavior to gain trust and then sell this influence for 
spam links. It would be interesting to study the sensitivity of the algorithm to various types of 
attacks. 

A related issue is that of navigation along links from a site. One is more likely to trust a link on 
a highly ranked page, and one is more likely to follow a link to a highly ranked page. For example, 
it might be much more probable to follow one of the links from a search engine or a news Web site 
than a regular web page. The probability to navigate from a page in the PageRank algorithm is 
independent of a page's rank, and the link one selects to navigate is random. A plausible alternative 
is that the probability to navigate from a page should be proportional to the page's pagerank, and 
the probability to use a particular outgoing link is proportional to the pagerank of the destination 
page. Such a navigation model would lead to an equation (analogous to ([!])) of the form 



More effort could be spent on how the transition probabilities generally affect the pageranks and 
their manipulability. [12 j discusses such issues for nodes with unknown outgoing links and [21] 
uses the amount of traffic flow through the nodes to model the transition probabilities. It would 
be interesting to see what the optimal attack with such ranking algorithms is. In short, objective 
methods for the selection of the anchor sites or more plausible navigation models deserves closer 
examination. One must also bear in mind (see for example [12]) that the computational complexity 
of the algorithm is also an important practical consideration for any ranking algorithm. 

Other factors, which we do not study here, might be significant to the success of an attack. 
[llj argues that anchor text pointing to a page gives information ragarding the subject matter of 
that page, and relationships between different pages. For example, Google may consider both the 
pagerank and the frequency of keywords in links pointing to a page when computing the score of 
the page. Google bombs in the past used the same keywords when pointing to the attacked page, 
i.e., the bombing links were correlated in that they all had the same keywords, whereas in general, 
links pointing to a website would not display such a correlation. If some linear combination of 
these two factors is then used in the final score, it will favor attacks over the natural Web behavior. 
If some small group of sites use a specific keyword to point to a victim, it is unlikely that this 
groups's sites are unrelated, and one could (for example) add pseudo-links among these sites, since 
the expection would be that they participate in some group structure. As our results show, these 
pseudo-links will reduce the magnitude of the attack. One could go so far as to say that if after the 
addition of such pseudo-links in the graph, the pagerank distribution does not change significantly, 
then the ranking algorithm should be more resistant to manipulation. 

The analysis of the optimal attack structure provides a new tool for looking at resistance to link 
manipulation. Such metrics and an understanding of optimal attack formations for other algorithms 
should be fruitful directions for future work. 

Acknowledgements 

We are grateful to Mark Goldberg for his initial feedback on this work. 

This research is continuing through participation in the Network Science Collaborative Technol- 
ogy Alliance sponsored by the U.S. Army Research Laboratory under Agreement Number W911NF- 
09-2-0053. The views and conclusions contained in this document are those of the author(s) and 
should not be interpreted as representing the official policies, either expressed or implied, of the 
U.S. Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to 




27 



reproduce and distribute reprints for Government purposes notwithstanding any copyright notation 
hereon. 

References 

[1] S. Adali, T. Liu, and M. Magdon-Ismail. Optimal link bombs are uncoordinated. In First Int. 
Wkshp. on Adversarial Information Retrieval on the Web (AIRWeb) at the IJ^th International 
WWW Conf., Chiba, Japan, 10-14 May 2005. (electronic proceedings). 

[2] R. Baeza- Yates, C. Castillo, and V. Liopez. Pagerank increase under different collusion topolo- 
gies. In Proceedings of 1st International Workshop on Adversarial Information Retrieval on 
the Web, 2005. 

[3] A.-L. Barabasi and R. Albert. The emergence of scaling in random networks. Science, 
286(5439):509 - 512, 1999. 

[4] M. Bianchini, M. Gori, and F. Scarselli. Inside pagerank. ACM Transactions on Internet 
Technology, 5:92-128, 2005. 

[5] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Computer 
Networks and ISDN Systems (special issue of 7th International WWW Conference), pages 107- 
117, Brisbane, Australia, 1998. 

[6] J. Caverlee, S. Webb, L. Liu, and W. B. Rouse. A parameterized approach to spam-resilient 
link analysis of the web. IEEE Transactions on Parallel and Distributed Systems, 20:1422- 
1438, 2009. 

[7] A. Cheng and E. Friedman. Manipulability of pagerank under sybil strategies. In Proceedings 
of the First Workshop on the Economics of Networked Systems, 2006. 

[8] S. Chien, C. Dwork, R. Kumar, D. Simon, and D. Sivakumar. Towards exploiting link evolu- 
tion. In Workshop on Algorithms for the Web, 2002. 

[9] I. Drost and T. Scheffer. Thwarting the nigritude ultramarine: Learning to identify link spam. 
In ECML 2005, LNAI 3120, pages 96-107, 2005. 

[10] Y. Du, Y. Shi, and X. Zhao. Using spam farm to boost pagerank. In Proceedings of Third 
International Workshop on Adversarial Information Retrieval on the Web, 2007. 

[11] N. Eiron and K. S. McCurley. Analysis of anchor text for web search. In 26th International 
ACM SIGIR Conference, pages 459 - 460, Toronto, Canada, 2003. ACM. 

[12] N. Eiron, K. S. McCurley, and J. A. Tomlin. Ranking the web frontier. In 13th International 
WWW Conference, pages 309 - 318, New York, NY, 2004. ACM. 

[13] Z. Gyongyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 31st International 
Conference on Very Large Data Bases, 2005. 

[14] Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In 
Proceedings of the 30th International Conference on Very Large Data Bases, 2004. 

[15] T. H. Haveliwala. Topic-sensitive pagerank. In 11th International WWW Conference, Hon- 
olulu, Hawaii, 2002. ACM. 

[16] T. McNichol. Engineering google results to make a point. NY Times, Jan 2004. 



28 



[17] A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In 24th 
International SIGIR Conference, pages 258 - 266, New Orleans, Louisiana, 2001. ACM. 

[18] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing 
order to the web. Technical report, Stanford University Database Group, 1998. 

[19] D. M. Pennock, G. W. Flake, S. Lawrence, E. J. Gloverand, and C. L. Giles. Winner's don't 
take all: Characterizing the competition for links on the web. Proceedings of the National 
Academy of Sciences, 99(8):5207-5211, 2002. 

[20] D. Sullivan. Google's (and inktomi's) miserable failure, Jan 2004. 

http:/ /searchenginewatch.com/sereport/. 

[21] J. Tomlin. A new paradigm for ranking pages on the world wide web. In 12th International 
WWW Conference, pages 350 - 355, Budapest, Hungary, 2003. ACM. 

[22] H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based repu- 
tation systems robust to collusion. In Workshop on Algorithms and Models for the Web Graph 
(WAW), pages 92-104, San Diego, CA, 2004. Springer. 

[23] H. Zhang, A. Goel, R. Govindan, K. Mason, and B. V. Roy. Making eigenvector-based rep- 
utation systems robust to collusion. In Proceedings of the Third Workshop on Web Graphs 
(WAW), pages 92-104, 2004. 



29 



Random and Cycle attack, group size 10 













Cycle attack, 10 — i — 
Cycle attack, 20 — x — 
Random attack, 1 — -x— 
Random attack, 20 a 
























































































































J 










< -> 




» ■» 



0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 
Average pagerank range of the attacked page 



100 



88 



" — 

\ \n 


& 






Cycle attack, 1 — i — 
Cycle attack, 20 — x — 
Random attack, 10 — -x— 
Random attack, 20 a 








\ V * 


Q 
*-X;-., 


a-^rfr 

x 

x. 


- 

X 


/ 


... 






x x . 

x 




v 

X 


■ 

r 




X 








x ': = 
m 

X 












X 












X 


I 



5000 10000 15000 20000 25000 



Average rank range of the attacking group 







*/ 








/ w 

' *~A 

' / % 
























fl I* 
































































Cycle attack, 1 — i — 
Cycle attack, 20 — x — 
Random attack, 10 — -x— 
Random attack, 20 a 











100 200 300 400 500 600 



Average rank range of the attacked page 



